KAFKA-9987: optimize sticky assignment algorithm for same-subscription case #8668

ableegoldman · 2020-05-14T18:46:15Z

Motivation and pseudo code algorithm in the ticket.

Added a scale test with large number of topic partitions and consumers and 30s timeout.
With these changes, assignment with 2,000 consumers and 200 topics with 2,000 each completes within a few seconds.

Porting the same test to trunk, it took 2 minutes even with a 100x reduction in the number of topics (ie, 2 minutes for 2,000 consumers and 2 topics with 2,000 partitions)

Should be cherry-picked to 2.6, 2.5, and 2.4

ableegoldman · 2020-05-29T00:49:55Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignor.java

@@ -303,79 +469,17 @@ private int getBalanceScore(Map<String, List<TopicPartition>> assignment) {
                                                Map<String, List<TopicPartition>> consumer2AllPotentialPartitions) {
        List<TopicPartition> sortedPartitions = new ArrayList<>();

-        if (!isFreshAssignment && areSubscriptionsIdentical(partition2AllPotentialConsumers, consumer2AllPotentialPartitions)) {


We can remove all this since we checked for identical subscriptions at the start, so we know that they are not

ableegoldman · 2020-05-29T03:01:16Z

clients/src/test/java/org/apache/kafka/clients/consumer/StickyAssignorTest.java

@@ -169,10 +169,10 @@ public void testAssignmentWithConflictingPreviousGenerations() {
        TopicPartition tp5 = new TopicPartition(topic, 5);

        List<TopicPartition> c1partitions0 = partitions(tp0, tp1, tp4);
-        List<TopicPartition> c2partitions0 = partitions(tp0, tp2, tp3);
+        List<TopicPartition> c2partitions0 = partitions(tp0, tp1, tp2);


This test was testing an illegal state to begin with: you should never have two consumers in the same generation claim to own the same partition. That fact is the entire reason for the generation field to be added to the StickyAssignor's subscription userdata to begin with.

ableegoldman · 2020-05-29T03:03:51Z

...ts/src/test/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignorTest.java

@@ -582,35 +578,6 @@ public void testNoExceptionThrownWhenOnlySubscribedTopicDeleted() {
        assertTrue(assignment.get(consumerId).isEmpty());
    }

-    @Test
-    public void testConflictingPreviousAssignments() {


See comment above: this test was starting from an illegal state. Also, it doesn't make sense to place this in the AbstractStickyAssignorTest as the cooperative assignor can't have conflicting previous assignments. If a member thinks it still owns a partition that now belongs to another member, it will have to invoke onPartitionsLost before rejoining the group

ableegoldman · 2020-05-29T04:13:12Z

...ts/src/test/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignorTest.java

@@ -425,8 +422,36 @@ public void testSameSubscriptions() {
        assertTrue(assignor.isSticky());
    }

+    @Test(timeout = 30 * 1000)
+    public void testLargeAssignmentAndGroupWithUniformSubscription() {
+        int topicCount = 200;


On trunk, this test fails (hits the 30s timeout) even when you reduce the number of topics to just 1

ableegoldman · 2020-05-29T04:25:01Z

call for review any of @guozhangwang @hachikuji @vvcephei

guozhangwang · 2020-05-29T17:51:14Z

test this please

guozhangwang

LGTM. Just one nit comment.

guozhangwang · 2020-05-29T19:12:44Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignor.java

+            // other generations are, consider it as having lost its owned partition
+            if (!memberData.generation.isPresent() && maxGeneration > 0
+                    || memberData.generation.isPresent() && memberData.generation.get() < maxGeneration) {
+                consumerToOwnedPartitions.put(consumer, new ArrayList<>());


nit: to be consistent, we can just add consumer to membersWithOldGeneration and then let them to be cleared at the end.

Hm, it seems odd to clear it at the end since it's definitely already empty. Note, we're not overwriting the current partitions with an empty array, we're just initializing the assignment for this consumer. I'll add a comment though

guozhangwang · 2020-05-29T19:27:51Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignor.java

+        for (String consumer : unfilledMembers) {
+            List<TopicPartition> consumerAssignment = assignment.get(consumer);
+            int remainingCapacity = minQuota - consumerAssignment.size();
+            while (remainingCapacity > 0) {


Is it possible that this unfilled consumer has N+1 remaining capacity, while there's only N max consumer only?

NVM, I realized it should never happen.

ableegoldman · 2020-05-30T02:29:09Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignor.java

+
+    // Keep track of the partitions being migrated from one consumer to another during assignment
+    // so the cooperative assignor can adjust the assignment
+    protected Map<TopicPartition, String> partitionsTransferringOwnership = new HashMap<>();


This is just an optimization for the cooperative case: I found that the assignment time for the eager and cooperative assignor began to diverge once you reached partition counts in the millions. At 10 million partitions for example, the eager assignor hovered around 30s but the cooperative assignor was upwards of 5-6 minutes.
The discrepancy was entirely due to the adjustAssignment method needing to compute the set of partitions transferring ownership in the completed assignment. But we can build up this map during assignment much more efficiently, by taking advantage of the additional context we have at various steps in the algorithm. Tracking and exposing this set to the cooperative assignor cut the assignment time for large partition numbers pretty drastically, putting the cooperative assignor on par with the eager assignor.

ableegoldman · 2020-05-30T02:30:31Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignor.java

+        } else {
+            log.debug("Detected that all not consumers were subscribed to same set of topics, falling back to the "
+                          + "general case assignment algorithm");
+            partitionsTransferringOwnership = null;


I didn't bother to include this optimization for the general case. We know that the assignment algorithm itself becomes a bottleneck at only 2,000 partitions, so there's no point optimizing something that only becomes a bottleneck in the millions of partitions

ableegoldman · 2020-05-30T02:37:57Z

...ts/src/test/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignorTest.java

-        partitionsPerTopic.put(topic, 3);
-        partitionsPerTopic.put(otherTopic, 3);
-        subscriptions = Collections.singletonMap(consumerId, new Subscription(topics(topic)));


This test was also starting with an illegal state -- partitionsPerTopic only contains metadata for topics included in the subscription. I noticed that we don't seem to be testing the actual valid case, where some consumers have ownedPartitions which are no longer in the subscription, so I just adapted this test for the related purpose

ableegoldman · 2020-05-30T02:38:41Z

@guozhangwang made a few more changes, ready for another review

twmb · 2020-06-01T18:19:56Z

From what I can tell, this looks good to me. This loses one mostly insignificant "optimization" that does not really affect anything in reality: prior, if an old-generation member is rejoining, the code would try to re-sticky partitions to those old members for any partitions that are now on overloaded members or are unassigned. This is a pretty minor optimization though, and deleting this logic entirely from my own balancer breaks no tests.

This algorithm primarily differs from mine by doing a bunch of up front checking work, and then doing a "single" pass that performs all assignments. Mine does a bunch of assigning while doing checks, and then does a small balancing pass. Both of these options are great, though!

Pretty nifty observation about building partitionsTransferringOwnership while doing assignment. I'm going to have to figure out if that's even possible with my approach--your algorithm can do that because of its one pass, whereas mine loses some context of who started with what by the time it gets to balancing.

guozhangwang

LGTM on the new optimization.

ableegoldman · 2020-06-01T22:16:09Z

@twmb yeah, I should point out in the ticket that this approach drops the optimization giving preference to older-generation owners of a partition. I actually don't think it would be particularly difficult to incorporate into this new algorithm, but my take was that it still adds more complexity than any benefit it provides.
We actually dropped this implicitly in the cooperative assignor, since a member with an older generation will have to give up all of its owned partitions before rejoining the group anyway.

It was nice to be able to build up the partitionsTransferringOwnership with the additional context we have while crafting the assignment, but to be fair it may be somewhat of an over-optimization at this point. The adjustAssignment loop that was building it up from scratch still performed fine up to ~5-10 million partitions. But I figure, better to optimize now and not have to worry about it later 🙂

…n case (#8668) Motivation and pseudo code algorithm in the ticket. Added a scale test with large number of topic partitions and consumers and 30s timeout. With these changes, assignment with 2,000 consumers and 200 topics with 2,000 each completes within a few seconds. Porting the same test to trunk, it took 2 minutes even with a 100x reduction in the number of topics (ie, 2 minutes for 2,000 consumers and 2 topics with 2,000 partitions) Should be cherry-picked to 2.6, 2.5, and 2.4 Reviewers: Guozhang Wang <wangguoz@gmail.com>

guozhangwang · 2020-06-01T23:17:45Z

Cherry-picked to 2.6/2.5/2.4.

ijuma · 2020-06-02T14:47:43Z

Did we check the build before merging this? It seems to have broken it:
#8779

ijuma · 2020-06-02T14:49:21Z

@guozhangwang Looks like 2.6, 2.5 and 2.4 are broken too. You should generally also build locally when cherry-picking.

ableegoldman · 2020-06-02T16:40:53Z

Sorry @ijuma, I think I only ever ran the local tests + checkstyle, not the full suite. My mistake

ableegoldman · 2020-06-02T16:42:49Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractStickyAssignor.java

+                // If the current member's generation is higher, all the previous owned partitions are invalid
+                if (memberData.generation.isPresent() && memberData.generation.get() > maxGeneration) {
+                    membersWithOldGeneration.addAll(membersOfCurrentHighestGeneration);
+                    membersOfCurrentHighestGeneration.clear();


Just FYI, I introduced this bug right before merging. Luckily the tests caught it -- fix is #8777

ijuma · 2020-06-02T20:09:53Z

@ableegoldman No worries, I do the same. We just need to check the PR result before merging. Additionally, committers should run checkstyle and spotBugs when cherry-picking to older branches.

* apache-github/2.6: (32 commits) KAFKA-10083: fix failed testReassignmentWithRandomSubscriptionsAndChanges tests (apache#8786) KAFKA-9945: TopicCommand should support --if-exists and --if-not-exists when --bootstrap-server is used (apache#8737) KAFKA-9320: Enable TLSv1.3 by default (KIP-573) (apache#8695) KAFKA-10082: Fix the failed testMultiConsumerStickyAssignment (apache#8777) MINOR: Remove unused variable to fix spotBugs failure (apache#8779) MINOR: ChangelogReader should poll for duration 0 for standby restore (apache#8773) KAFKA-10030: Allow fetching a key from a single partition (apache#8706) Kafka-10064 Add documentation for KIP-571 (apache#8760) MINOR: Code cleanup and assertion message fixes in Connect integration tests (apache#8750) KAFKA-9987: optimize sticky assignment algorithm for same-subscription case (apache#8668) KAFKA-9392; Clarify deleteAcls javadoc and add test for create/delete timing (apache#7956) KAFKA-10074: Improve performance of `matchingAcls` (apache#8769) KAFKA-9494; Include additional metadata information in DescribeConfig response (KIP-569) (apache#8723) KAFKA-10056; Ensure consumer metadata contains new topics on subscription change (apache#8739) KAFKA-10029; Don't update completedReceives when channels are closed to avoid ConcurrentModificationException (apache#8705) KAFKA-10061; Fix flaky `ReassignPartitionsIntegrationTest.testCancellation` (apache#8749) KAFKA-9130; KIP-518 Allow listing consumer groups per state (apache#8238) KAFKA-9501: convert between active and standby without closing stores (apache#8248) MINOR: Relax Percentiles test (apache#8748) MINOR: regression test for task assignor config (apache#8743) ...

ableegoldman commented May 29, 2020

View reviewed changes

ableegoldman added 5 commits May 28, 2020 18:00

refactor AbstractStickyAssignor and implement optimized algorithm

dbac63d

rough benchmarks

2b2a011

remove areSubscriptionsIdentical when we know they are not

a812d34

simplify

50486af

renaming

afbed93

ableegoldman force-pushed the 9987-efficient-CooperativeStickyAssignor branch from 635f8f0 to afbed93 Compare May 29, 2020 01:00

ableegoldman added 3 commits May 28, 2020 18:07

initialize movements

52c3285

trim by subscription

f7a6fa4

fixing tests

f3a074f

ableegoldman commented May 29, 2020

View reviewed changes

ableegoldman added 4 commits May 28, 2020 20:04

remove unused imports

ff28bad

improve data parallelism

9a11976

use Optional.empty

63e797f

apply timeout and increase number of topics

e397565

ableegoldman commented May 29, 2020

View reviewed changes

guozhangwang approved these changes May 29, 2020

View reviewed changes

ableegoldman added 10 commits May 29, 2020 13:55

add comment and fix checkstyle

e5eeb7d

remove more unused code

18b13cb

simplify a bit

07267a2

remove unnecessary check and fix test to verify valid state

f6ce366

filter by cluster topics

d9c93d0

simplify owned partition aassignemnt

e9b243a

can reduce further

ad1f63d

condition can be tighter

28d13ba

optimize by tracking reassigned partitions

7ec6131

sclae to 1million

6210a75

ableegoldman commented May 30, 2020

View reviewed changes

checkstyle

e25a323

ableegoldman commented May 30, 2020

View reviewed changes

guozhangwang approved these changes Jun 1, 2020

View reviewed changes

guozhangwang merged commit c6633a1 into apache:trunk Jun 1, 2020

ableegoldman commented Jun 2, 2020

View reviewed changes

ableegoldman deleted the 9987-efficient-CooperativeStickyAssignor branch June 26, 2020 22:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-9987: optimize sticky assignment algorithm for same-subscription case #8668

KAFKA-9987: optimize sticky assignment algorithm for same-subscription case #8668

ableegoldman commented May 14, 2020 •

edited

ableegoldman May 29, 2020

ableegoldman May 29, 2020

ableegoldman May 29, 2020

ableegoldman May 29, 2020

ableegoldman commented May 29, 2020

guozhangwang commented May 29, 2020

guozhangwang left a comment

guozhangwang May 29, 2020

ableegoldman May 29, 2020

guozhangwang May 29, 2020

guozhangwang May 29, 2020

ableegoldman May 30, 2020

ableegoldman May 30, 2020

ableegoldman May 30, 2020

ableegoldman commented May 30, 2020

twmb commented Jun 1, 2020

guozhangwang left a comment

ableegoldman commented Jun 1, 2020

guozhangwang commented Jun 1, 2020

ijuma commented Jun 2, 2020

ijuma commented Jun 2, 2020 •

edited

ableegoldman commented Jun 2, 2020

ableegoldman Jun 2, 2020

ijuma commented Jun 2, 2020

KAFKA-9987: optimize sticky assignment algorithm for same-subscription case #8668

KAFKA-9987: optimize sticky assignment algorithm for same-subscription case #8668

Conversation

ableegoldman commented May 14, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ableegoldman commented May 29, 2020

guozhangwang commented May 29, 2020

guozhangwang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ableegoldman commented May 30, 2020

twmb commented Jun 1, 2020

guozhangwang left a comment

Choose a reason for hiding this comment

ableegoldman commented Jun 1, 2020

guozhangwang commented Jun 1, 2020

ijuma commented Jun 2, 2020

ijuma commented Jun 2, 2020 • edited

ableegoldman commented Jun 2, 2020

Choose a reason for hiding this comment

ijuma commented Jun 2, 2020

ableegoldman commented May 14, 2020 •

edited

ijuma commented Jun 2, 2020 •

edited